In this document, we analyse missing data patterns in the dataset. The dataset has been obtained as described in the document “Download subsample of pollution data” (ie using europollution package and downloading data for a random sample of 10% of the stations, from 2013 to 2018). In a first step, we focus on one pollutant only, PM10. The present document is build so that it can be easily modified and recomputed for other pollutants or another sample.
Location of measurement stations
In order to assess by guesswork the representativity of the subsample drawn, we look at the location of the station selected:

The sample seems to cover most of Europe and should therefore be somehow representative. Note that the previous map does not represent the full set of stations. For readability reasons, we considered a zoomed version of the map. For more details, one can refer to the following interactive map.
We also consider the number of stations per country

Missing data patterns in the subsample
Here, we investigate whether missing PM10 concentration data varies across different dimentions.
Across countries
The overall share of missing value is 0.3781124.
We can break this down by countries.

One can notice that the share of missing values varies drastically across countries. This can be due to the quality of monitoring of air pollution stations. This variation might also come from our data wrangling. We need to check that.
Across dates

One may notice that the share of missing data slightly decreased over time. This might be due to improvement in air measurement capacity.
We look at whether this variation comes from some countries more than others.

We now investigate whether the share of missing data varies across month of the year.



Across time of the day
---
title: "Analysis of missing data patterns"
output: 
  html_notebook: 
    toc: no
    theme: simplex
    highlight: pygments
  html_document:
    theme: simplex
    highlight: pygments
author: "Vincent Bagilet, Léo Zabrocki"
date: "July 30, 2020"
---

```{r setup, include=FALSE, results='hide', warning=FALSE}
library(knitr)
opts_chunk$set(fig.path = "images/",
               cache.path = "cache/",
               cache = FALSE,
               echo = TRUE,
               message = FALSE,
               warning = FALSE)  
```  

```{r include=FALSE}
library(tidyverse)
library(europollution)
library(maps)
library(lubridate)
library(leaflet)

theme_set(theme_minimal())

load("../Outputs/metadata.Rda")
load("../Outputs/subsample_stations_ids.Rda")

wrangled_metadata <- metadata %>% 
  ep_wrangle_metadata() %>% 
  mutate(
    in_subsample = (station_id %in% subsample_stations_ids)
  ) %>% 
  select(-opening_station, -closing_station,-measurement_type) %>% #I need to withdraw that and fix it in europollution
  distinct() 

load("../Outputs/subsample_PM10.Rda")

data <- subsample_PM10 %>% 
  group_by(station_id, pollutant) %>% 
  mutate(
    concentration = ifelse(minute(lead(date)) != 0 & is.na(concentration), lead(concentration), concentration)
  ) %>% 
  ungroup() %>% 
  filter(minute(date) == 0) %>% #I need to fix this in europollution 
  left_join(wrangled_metadata, by = c("station_id", "country_iso", "sampling_point"))

rm(subsample_PM10)
```

In this document, we analyse missing data patterns in the dataset. The dataset has been obtained as described in the document "Download subsample of pollution data" (*ie* using `europollution` package and downloading data for a random sample of 10% of the stations, from 2013 to 2018). In a first step, we focus on one pollutant only, PM10. The present document is build so that it can be easily modified and recomputed for other pollutants or another sample.

# Location of measurement stations

In order to assess by guesswork the representativity of the subsample drawn, we look at the location of the station selected:

```{r}

map_raster <- map_data("world")

wrangled_metadata %>% 
  filter(longitude < 47 & longitude > -25 & latitude > 35 & latitude < 71) %>% 
  ggplot() +
  geom_point(aes(x = longitude, y = latitude, color = in_subsample), size = .05) +
  # geom_polygon(data = map_raster, 
  #                        aes(x = long, y = lat, group=group), 
  #                        color = "grey", fill = NA) +
  coord_map(projection = "azequidistant") +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "Location of stations present in the random subsample considered", x = "", y = "", color = "Station in sample") 
```

The sample seems to cover most of Europe and should therefore be somehow representative. Note that the previous map does not represent the full set of stations. For readability reasons, we considered a zoomed version of the map. For more details, one can refer to the following interactive map. 

```{r}
pal <- colorFactor("Dark2", domain = wrangled_metadata$in_subsample) 
color_in_subsample <- pal(wrangled_metadata$in_subsample)
content <- paste("Station type:", wrangled_metadata$station_type, "<br/>",
                 "Station area:", wrangled_metadata$station_area, "<br/>")

# wrangled_metadata %>% 
#   select(station_id, longitude,latitude, in_subsample, station_area, station_type) %>% 
#   distinct() %>% 
#   leaflet() %>%
#   addTiles() %>%    
#   addProviderTiles("Esri.WorldGrayCanvas") %>%
#   addCircles(col = color_in_subsample, popup = content, lng = ~longitude, lat = ~latitude) %>%
#   addLegend(pal = pal, values = ~wrangled_metadata$in_subsample , title = "Station in sample")
```

We also consider the number of stations per country

```{r}
nb_station_country <- wrangled_metadata %>% 
  group_by(country_iso) %>% 
  summarise(
    nb_station_country = n(),
    prop_in_country_whole = n()/nrow(wrangled_metadata),
    prop_in_country_subsample = sum(in_subsample)/sum(wrangled_metadata$in_subsample)
  )

nb_station_country %>% 
  kable(col.names = c("Country ISO", "Number of stations in country", "Proportion of stations located in this country", "Proportion of stations in the subsample located in this country"))

nb_station_country %>% 
  pivot_longer(starts_with("prop_in"), names_to = "in_sample", values_to = "prop") %>% 
  mutate(in_sample = str_sub(in_sample, 17, nchar(in_sample))) %>% 
  ggplot() + 
  geom_point(aes(x = country_iso, y = prop, color = in_sample)) +
  scale_color_brewer(palette = "Dark2") +
  labs(title = "Proportion of stations by country in overall sample and subsample", x = "Country", y = "Proportion of stations", color = "Proportion in") 
```


# Missing data patterns in the subsample

Here, we investigate whether missing PM10 concentration data varies across different dimentions.

## Across countries

```{r include = FALSE}
share_missing <-  sum(is.na(data$concentration))/nrow(data)
```

The overall share of missing value is `r share_missing`.

We can break this down by countries.

```{r}
missing_share_country <- data %>% 
  group_by(country_iso) %>% 
  summarise(share_missing = sum(is.na(concentration))/n()) %>% 
  arrange(share_missing)

missing_share_country %>% 
  kable()

missing_share_country %>% 
  ggplot() + 
  geom_point(aes(x = reorder(country_iso, share_missing), y = share_missing), color = "darkBlue") + 
  labs(title = "Share of missing PM10 concentration data by country", x = "Country", y = "Share of missing PM10 concentration data") 
```

One can notice that the share of missing values varies drastically across countries. This can be due to the quality of monitoring of air pollution stations. This variation might also come from our data wrangling. We need to check that.

## Across dates

```{r}
data %>% 
  filter(year(date) > 2012) %>% #I need to investigate why I have data for 2012
  mutate(grouping_var = year(date)) %>% 
  group_by(grouping_var) %>% 
  summarise(share_missing = sum(is.na(concentration))/n()) %>% 
  arrange(share_missing) %>% 
  ggplot() + 
  geom_point(aes(x = grouping_var, y = share_missing), color = "darkBlue") + 
  labs(title = "Share of missing PM10 concentration data by year", x = "Year", y = "Share of missing PM10 concentration data") +
  ylim(0, 1)
```

One may notice that the share of missing data slightly decreased over time. This might be due to improvement in air measurement capacity.

We look at whether this variation comes from some countries more than others.

```{r}
data %>% 
  mutate(year = year(date)) %>% 
  filter(year > 2012) %>% #I need to investigate why I have data for 2012
  group_by(country_iso, year) %>% 
  summarise(share_missing = sum(is.na(concentration))/n()) %>% 
  arrange(share_missing) %>% 
  ggplot() + 
  geom_point(aes(x = country_iso, y = share_missing, color = factor(year))) + 
  scale_color_brewer(palette = "Spectral") +
  labs(title = "Share of missing PM10 concentration data by country", x = "Country", y =  "Share of missing PM10 concentration data") +
  ylim(0, 1)
```

We now investigate whether the share of missing data varies across month of the year.

```{r}
data %>% 
  filter(year(date) > 2012) %>% #I need to investigate why I have data for 2012
  mutate(grouping_var = month(date)) %>% 
  group_by(grouping_var) %>% 
  summarise(share_missing = sum(is.na(concentration))/n()) %>% 
  arrange(share_missing) %>% 
  ggplot() + 
  geom_point(aes(x = grouping_var, y = share_missing), color = "darkBlue") + 
  labs(title = "Share of missing PM10 concentration data by month", x = "Month", y = "Share of missing PM10 concentration data") +
  ylim(0, 1)
```

```{r}
data %>% 
  filter(year(date) > 2012) %>% #I need to investigate why I have data for 2012
  mutate(grouping_var = day(date)) %>% 
  group_by(grouping_var) %>% 
  summarise(share_missing = sum(is.na(concentration))/n()) %>% 
  arrange(share_missing) %>% 
  ggplot() + 
  geom_point(aes(x = grouping_var, y = share_missing), color = "darkBlue") + 
  labs(title = "Share of missing PM10 concentration data by day of the month", x= "Day of the month", y = "Share of missing PM10 concentration data") +
  ylim(0, 1)
```


```{r}
data %>% 
  filter(year(date) > 2012) %>% #I need to investigate why I have data for 2012
  mutate(grouping_var = wday(date)) %>% 
  group_by(grouping_var) %>% 
  summarise(share_missing = sum(is.na(concentration))/n()) %>% 
  arrange(share_missing) %>% 
  ggplot() + 
  geom_point(aes(x = grouping_var, y = share_missing), color = "darkBlue") + 
  labs(title = "Share of missing PM10 concentration data by day of the month", x = "Day of the week", y = "Share of missing PM10 concentration data") +
  ylim(0, 1)
```


## Across time of the day







